Fault Tolerant Wide-Area Parallel Computing

نویسنده

  • Jon B. Weissman
چکیده

Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple-data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods1

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fault Tolerant Scheduling in Distributed Networks

We present a model for application-level fault tolerance for parallel applications. The objective is to achieve high reliability with minimal impact on the application. Our approach is based on a full replication of all parallel application components in a distributed wide-area environment in which each replica is independently scheduled in a different site. A system architecture for coordinati...

متن کامل

FIRE Ant with Activity Structures: a Biologically Motivated Approach to Fault-Tolerant Routing in Global Networks

– This paper presents a new approach to fault-tolerant routing for large-scale distributed parallel computing that will allow communications to continue to take place in the presence of many faults, while maintaining low overhead on the system. FIRE ANT algorithm uses its emergent behavior in order to make intelligent decisions (avoidance) when multiple failures exist between Autonomous Systems...

متن کامل

An approach to fault detection and correction in design of systems using of Turbo ‎codes‎

We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...

متن کامل

MPI/FT: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault -tolerant MPI middleware. Environments include space -based, wide -area/web/meta computing, and scalable clusters. MPI/FT , the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirem...

متن کامل

MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing, and scalable clusters. MPI/FT, the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000